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Abstract 

We  propose  a  family  of  efficient  algorithms  for 
learning  the  parameters  of  a  Bayesian  network 
from  incomplete  data.  Our  approach  is  based 
on  recent  theoretical  analyses  of  missing  data 
problems,  which  utilize  a  graphical  representa¬ 
tion,  called  the  missingness  graph.  In  the  case 
of  MCAR  and  MAR  data,  this  graph  need  not 
be  explicit,  and  yet  we  can  still  obtain  closed- 
form,  asymptotically  consistent  parameter  esti¬ 
mates,  without  the  need  for  inference.  When  this 
missingness  graph  is  explicated  (based  on  back¬ 
ground  knowledge),  even  partially,  we  can  obtain 
even  more  accurate  estimates  with  less  data.  Em¬ 
pirically,  we  illustrate  how  we  can  learn  the  pa¬ 
rameters  of  large  networks  from  large  datasets, 
which  are  beyond  the  scope  of  algorithms  like 
EM  (which  require  inference). 

1  INTRODUCTION 

When  learning  the  parameters  of  a  Bayesian  network  from 
data  with  missing  values,  the  conventional  wisdom  among 
machine  learning  practitioners  is  that  there  are  two  options: 
use  expectation  maximization  (EM)  or  gradient  methods  (to 
optimize  the  likelihood);  see,  e.g.,  Darwiche  (2009),  Roller 
and  Friedman  (2009),  Murphy  (2012),  Barber  (2012).  Both 
of  these  approaches,  however,  suffer  from  the  following 
disadvantages,  which  prevent  them  from  scaling  to  large 
networks  and  datasets;  see  also  Thiesson,  Meek,  and  Heck- 
erman  (2001).  First,  they  are  iterative,  and  hence  may  need 
many  passes  over  a  potentially  large  dataset.  Next,  these  al¬ 
gorithms  may  get  stuck  in  local  optima ,  which  means  that, 
in  practice,  one  must  run  these  algorithms  multiple  times 
with  different  initial  seeds,  and  hope  that  one  of  them  leads 
to  a  good  optimum.  Last,  but  not  least,  these  methods  re¬ 
quire  inference  in  the  network,  which  places  a  hard  limit 
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on  the  networks  where  EM  and  gradient  methods  can  even 
be  applied,  namely  for  networks  where  exact  inference  is 
tractable,  i.e.,  they  have  small  enough  treewidth,  or  suffi¬ 
cient  local  structure  (Chavira  &  Darwiche,  2006,  2007). 

Recently,  Mohan,  Pearl,  and  Tian  (2013)  showed  that  the 
joint  distribution  of  a  Bayesian  network  is  recoverable  from 
incomplete  data,  including  data  that  falls  under  the  clas¬ 
sical  missing  at  random  assumption  (MAR),  but  also  for 
a  broad  class  of  data  that  is  not  MAR.  Their  analysis  is 
based  on  a  graphical  representation  for  missing  data  prob¬ 
lems,  called  the  missingness  graph,  where  one  explicates 
the  causal  mechanisms  that  are  responsible  for  the  miss¬ 
ingness  in  an  incomplete  dataset.  Using  this  representation, 
they  provide  a  way  to  decide  whether  a  given  query  (e.g., 
a  joint  marginal)  is  recoverable,  and  if  so,  they  provide  a 
closed-form  expression  (in  terms  of  the  observables)  for  an 
asymptotically  consistent  estimate. 

Building  on  the  theoretical  foundations  set  by  Mohan  et  al. 
(2013),  we  propose  a  family  of  practical  and  efficient  al¬ 
gorithms  for  estimating  the  parameters  of  a  Bayesian  net¬ 
work  from  incomplete  data.  For  the  cases  of  both  MCAR 
and  MAR  data,  where  the  missingness  graph  need  not  be 
explicit,  we  start  by  deriving  the  closed-form  parameter 
estimates,  as  implied  by  Mohan  et  al.  (2013).  We  next 
show  how  to  obtain  better  estimates,  by  exploiting  a  factor¬ 
ized  representation  that  allows  us  to  aggregate  distinct,  yet 
asymptotically  equivalent  estimates,  hence  utilizing  more 
of  the  data.  We  also  show  how  to  obtain  improved  es¬ 
timates,  when  the  missingness  graph  is  only  partially  ex¬ 
plicated  (based  on  domain  or  expert  knowledge).  As  in 
Mohan  et  al.  (2013),  all  of  our  estimation  algorithms  are 
asymptotically  consistent,  i.e.,  they  converge  to  the  true  pa¬ 
rameters  of  a  network,  in  the  limit  of  infinite  data. 

As  we  show  empirically,  our  parameter  estimation  algo¬ 
rithms  make  learning  from  incomplete  data  viable  for  larger 
Bayesian  networks  and  larger  datasets,  that  would  other¬ 
wise  be  beyond  the  scope  of  algorithms  such  as  EM  and 
gradient  methods.  In  particular,  our  algorithms  (1)  are  non¬ 
iterative,  requiring  only  a  single  pass  over  the  data,  (2)  pro¬ 
vide  estimates  in  closed-form,  and  hence  do  not  suffer  from 
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(a)  Dataset  T>  and  DAG  (b)  Missingness  Dataset  T>  and  DAG 


noted  by  Ry=  ob.  Variable  V*  acts  as  a  proxy  on  the  value 
of  Y,  which  may  be  an  observed  value  y,  or  a  special  value 
(mi)  when  the  value  of  Y  is  missing.  The  value  of  Y*  thus 
depends  functionally  on  variables  Ry  and  Y: 


Y*  =  f(Ry,  Y) 


mi  if  Ry  =  unob 
Y  if  Ry  =  ob 


Figure  1:  Datasets  and  DAGs. 


That  is,  when  Ry=  unob,  then  Y*—  mi;  otherwise  Ry=  ob 
and  the  proxy  Y*  assumes  the  observed  value  of  Y . 


local  optima,  and  (3)  require  no  inference,  which  is  the  pri¬ 
mary  limiting  factor  for  the  scalability  of  algorithms  such 
as  EM.  We  note  that  these  advantages  are  also  available 
when  learning  Bayesian  networks  from  complete  data. 

2  TECHNICAL  PRELIMINARIES 

In  this  paper,  we  use  upper  case  letters  ( X )  to  denote  vari¬ 
ables  and  lower  case  letters  ( x )  to  denote  their  values.  Vari¬ 
able  sets  are  denoted  by  bold-face  upper  case  letters  (X) 
and  their  instantiations  by  bold-face  lower  case  letters  (x). 
Generally,  we  will  use  X  to  denote  a  variable  in  a  Bayesian 
network  and  U  to  denote  its  parents.  A  network  parameter 
will  therefore  have  the  general  form  0xiu,  representing  the 
probability  Pr(X=x  |U=u). 

Given  an  incomplete  dataset  D,  we  want  to  learn  the  pa¬ 
rameters  of  the  Bayesian  network  Af  that  the  dataset  orig¬ 
inated  from.  This  network  induces  a  distribution  Pr(X), 
which  is  in  general  unknown;  instead,  we  only  have  access 
to  the  dataset  D. 

2.1  MISSING  DATA:  AN  EXAMPLE 

As  an  illustrative  example,  consider  Figure  1(a),  depicting 
a  dataset  X>,  and  the  directed  acyclic  graph  (DAG)  G  of  a 
Bayesian  network,  both  over  variables  A'  and  Y.  Here,  the 
value  for  variable  X  is  always  observed  in  the  data,  while 
the  value  for  variable  Y  can  be  missing.  In  the  graph,  we 
denote  a  variable  that  is  always  observed  with  a  double¬ 
circle.  Now,  if  we  happen  to  know  the  mechanism  that 
causes  the  value  of  Y  to  become  missing  in  the  data,  we 
can  include  it  in  our  model,  as  in  Figure  1(b).  Here,  we 
use  a  variable  Ry  to  represent  the  mechanism  that  controls 
whether  the  value  of  variable  Y  is  missing  or  observed. 
Further,  we  witness  the  value  of  Y ,  or  its  missingness, 
through  a  proxy  variable  Y* ,  as  an  observation.  Such  a 
graph,  which  explicates  the  missing  data  process,  is  called 
a  missingness  graph. 

In  our  example,  we  augmented  the  dataset  and  graph  with 
new  variables  Ry,  representing  the  causal  mechanism  that 
dictates  the  missingness  of  the  value  of  Y.  This  mechanism 
can  be  active  (Y  is  unobserved),  denoted  by  Ry=  unob. 
Otherwise,  the  mechanism  is  passive  (Y  is  observed),  de¬ 


Figure  2:  An  MNAR  missingness  graph. 

Figure  2  highlights  a  more  complex  example  of  a  missing¬ 
ness  graph,  with  causal  mechanisms  Rx  and  Ry  that  de¬ 
pend  on  other  variables.  In  Section  2.3,  we  highlight  how 
different  missing  data  problems  (such  as  MCAR  and  MAR) 
lead  to  different  types  of  missingness  graphs. 

2.2  LEARNING  WITH  MISSINGNESS  GRAPHS 

Given  a  Bayesian  network  with  DAG  G,  and  an  incomplete 
dataset  D,  we  can  partition  the  variables  X  into  two  sets: 
the  fully-observed  variables  XQ,  and  the  partially-observed 
variables  Xm  that  have  missing  values  in  D.  As  in  our  ex¬ 
ample  above,  one  can  take  into  account  knowledge  about 
the  processes  responsible  for  the  missingness  in  T>.  More 
specifically,  we  can  incorporate  the  causal  mechanisms  that 
cause  the  variables  Xm  to  have  missing  values,  by  intro¬ 
ducing  (1)  variables  R  representing  the  causal  mechanisms 
that  are  responsible  for  missingness  in  the  data,  and  (2) 
variables  X*n  that  act  as  proxies  to  the  variables  XTO.  This 
augmented  Bayesian  network,  which  we  refer  to  as  the 
missingness  graph  A f* ,  has  variables  X0,X*l7R  that  are 
fully-observed,  and  variables  Xm  that  are  only  partially- 
observed.  The  missingness  graph  Af*  thus  induces  a  distri¬ 
bution  Pr(X0,  Xm,  X^,  R).  Using  the  missingness  graph, 
we  want  to  draw  conclusions  about  the  partially-observed 
variables,  by  reasoning  about  the  fully-observed  ones. 

Missingness  graphs  can  serve  as  a  powerful  tool  for  analyz¬ 
ing  missing  data  problems;  see,  e.g.,  Thoemmes  and  Mo¬ 
han  (2015),  Francois  and  Leray  (2007),  Darwiche  (2009), 
Roller  and  Friedman  (2009).  As  Mohan  et  al.  (2013)  show, 
one  can  exploit  the  conditional  independencies  that  miss¬ 
ingness  graphs  encode,  in  order  to  extract  asymptotically 
consistent  estimates  for  missing  data  problems,  including 
MNAR  ones,  whose  underlying  assumptions  would  put  it 
out  of  the  scope  of  existing  techniques.1  Mohan  et  al. 

'Note  that  maximum-likelihood  estimation  is  asymptotically 
consistent,  although  a  consistent  estimator  is  not  necessarily  a 
maximum-likelihood  estimator;  see,  e.g.,  Wasserman  (2011). 


(2013)  identify  conditions  on  AT*  that  allow  the  original, 
partially-observed  distribution  Pr(XD,XTO)  to  be  identi¬ 
fied  from  the  fully-observed  distribution  Pr(XOJ  X*n,  R). 
However,  in  practice,  we  only  have  access  to  a  dataset  CD, 
and  the  corresponding  data  distribution  that  it  induces: 

Pr©(x0,x^,r)  = 

where  N  is  the  number  of  instances  in  dataset  CD,  and 
where  D#(x)  is  the  number  of  instances  where  instanti¬ 
ation  x  appears  in  the  data.2  However,  the  data  distribu¬ 
tion  Pr©  tends  to  the  true  distribution  Pr  (over  the  fully- 
observed  variables),  as  N  tends  to  infinity. 

Building  on  the  theoretical  foundations  set  by  Mohan  et  al. 
(2013),  we  shall  propose  a  family  of  efficient  and  scalable 
parameter  estimation  algorithms  from  incomplete  data.  In 
essence,  we  will  show  how  to  query  the  observed  data 
distribution  Pr©,  in  order  to  make  inferences  about  the 
true,  underlying  distribution  Pr(X0,XTO)  (in  particular, 
we  want  the  conditional  probabilities  that  parameterize  the 
given  Bayesian  network).  As  we  shall  discuss,  in  many 
cases  the  missingness  graph  need  not  be  explicit.  In  other 
cases,  when  there  is  knowledge  about  the  missingness 
graph,  even  just  partial  knowledge,  we  can  exploit  it,  in 
order  to  obtain  more  accurate  parameter  estimates. 

2.3  CATEGORIES  OF  MISSINGNESS 

An  incomplete  dataset  is  categorized  as  Missing  Com¬ 
pletely  At  Random  (MCAR)  if  all  mechanisms  R  that  cause 
the  values  of  variables  Xm  to  go  missing,  are  marginally 
independent  of  X,  i.e.,  where  (Xm,XG)  _LLR.  This  cor¬ 
responds  to  a  missingness  graph  where  no  variable  in 
Xm  U  XQ  is  a  parent  of  any  variable  in  R.  For  example, 
if  all  mechanisms  R  are  root  nodes,  then  the  problem  is 
MCAR.  Note  that  the  missingness  graph  of  Figure  1(b)  im¬ 
plies  an  MCAR  dataset. 

An  incomplete  dataset  is  categorized  as  Missing  At  Ran¬ 
dom  (MAR)  if  missingness  mechanisms  are  conditionally 
independent  of  the  partially-observed  variables  given  the 
fully-observed  variables,  i.e.,  if  XTO  _LL  R  |  XQ.  This  cor¬ 
responds  to  a  missingness  graph  where  variables  R  are  al¬ 
lowed  to  have  parents,  as  long  as  none  of  them  are  partially- 
observed.  In  the  example  missingness  graph  of  Figure  1(b), 
adding  an  edge  X  — >  Ry  results  in  a  graph  that  yields 
MAR  data.  This  is  a  stronger,  variable-level  definition 
of  MAR,  which  has  previously  been  used  in  the  machine 
learning  literature  (Darwiche,  2009;  Roller  &  Friedman, 
2009),  in  contrast  to  the  event-level  definition  of  MAR  that 
is  prevalent  in  the  statistics  literature  (Rubin,  1976). 

2Note  that  the  data  distribution  is  well-defined  over  the  vari¬ 
ables  X0,  XJjj  and  R,  as  they  are  fully-observed  in  the  augmented 
dataset,  and  that  Pr©  can  be  represented  compactly  in  space  lin¬ 
ear  in  N,  as  we  need  not  explicitly  represent  those  instantiations 
x  that  were  not  observed  in  the  data. 


Table  1 :  Summary  of  Estimation  Algorithms 


Algorithm 

Description  (Section  Number) 

D-MCAR 

Direct  Deletion  for  MCAR  data  (3.1) 

D-MAR 

Direct  Deletion  for  MAR  data  (3.2) 

F-MCAR 

Factored  Deletion  for  MCAR  data  (3.3) 

F-MAR 

Factored  Deletion  for  MAR  data  (3.3) 

I-MAR 

Informed  Deletion  for  MAR  data  (5.1) 

IF-MAR 

Informed  Factored  Deletion  for  MAR  data  (5.1) 

An  incomplete  dataset  is  categorized  as  Missing  Not  At 
Random  (MNAR)  if  it  is  not  MAR  (and  thus  not  MCAR). 
For  example,  the  DAG  in  Figure  2  corresponds  to  an 
MNAR  missingness  graph.  This  is  because  the  mechanism 
Rx  has  a  partially-observed  variable  as  a  parent;  further, 
mechanism  Ry  has  a  partially-observed  parent  X. 

3  CLOSED-FORM  LEARNING 

We  now  present  algorithms  to  learn  the  parameters  of  a 
Bayesian  network  AT  from  data  If  We  first  consider  the 
classical  missing  data  assumptions,  with  no  further  knowl¬ 
edge  about  the  missingness  graph  that  generated  the  data. 

To  estimate  the  conditional  probabilities  0:r\xl  that  parame¬ 
terize  a  Bayesian  network,  we  estimate  the  joint  distribu¬ 
tions  Pr(X,  U),  which  are  subsequently  normalized,  as  a 
conditional  probability  table.  Hence,  it  suffices,  for  our 
discussion,  to  estimate  marginal  distributions  Pr(Y)  for 
families  Y  =  {X}  U  U.  We  let  Y0  =  Y  n  X0  de¬ 
note  the  observed  variables  in  Y,  and  Ym  =  Y  FI  Xm 
denote  the  partially-observed  variables.  Further,  we  let 
Rz  C  R  denote  the  missingness  mechanisms  for  the 
partially-observed  variables  Z.  Through  If  we  have  access 
to  the  data  distribution  Pr©  over  the  variables  in  the  miss¬ 
ingness  dataset.  Appendix  D  illustrates  our  learning  algo¬ 
rithms  on  a  concrete  dataset  and  Table  1  gives  an  overview 
of  the  different  estimation  algorithms  in  this  paper. 

3.1  DIRECT  DELETION  FOR  MCAR 

The  statistical  technique  of  listwise  deletion  is  perhaps  the 
simplest  technique  for  performing  estimation  with  MCAR 
data:  we  simply  delete  all  instances  in  the  dataset  that 
contain  missing  values,  and  estimate  our  parameters  from 
the  remaining  dataset,  which  is  now  complete.  Of  course, 
with  this  technique,  we  potentially  ignore  large  parts  of 
the  dataset.  The  next  simplest  technique  is  perhaps  pair¬ 
wise  deletion,  or  available-case  analysis:  when  estimating 
a  quantity  over  a  pair  of  variables  X  and  Y ,  we  delete  just 
those  instances  where  variable  X  or  variable  Y  is  missing. 

Consider  now  the  following,  more  general,  deletion  tech¬ 
nique,  which  is  expressed  in  the  terms  of  causal  missing¬ 
ness  mechanisms.  In  particular,  to  estimate  the  marginals 
Pr(Y)  of  a  set  of  (family)  variables  Y,  from  the  data  dis- 


tribution  Pi'd,  we  can  use  the  estimate: 

Pr(Y)  =  Pr(Y0,  Ym|RYm=ob)  by  X0,Xm  ±L  R 
=  Pr(Y0,  Y*JRYm=ob)  by  Xm=X*m  when  R=ob 
~  Pr»(Y0,  Y*JRYm=ob) 

That  is,  we  can  estimate  Pr(Y)  by  using  the  subset  of  the 
data  where  every  variable  in  Y  is  observed  (which  follows 
from  the  assumptions  implied  by  MCAR  data).  Since  the 
data  distribution  Pr-x>  tends  to  the  true  distribution  Pr,  this 
implies  a  consistent  estimate  for  the  marginals  Pr(Y).  In 
contrast,  the  technique  of  listwise  deletion  corresponds  to 
the  estimate  Pr(Y)  «  Pru(Y0,  Y^l|Rxm=ob),  and  the 
technique  of  pairwise  deletion  corresponds  to  the  above, 
when  Y  contains  two  variables.  To  facilitate  comparisons 
with  more  interesting  estimation  algorithms  that  we  shall 
subsequently  consider,  we  refer  to  the  more  general  esti¬ 
mation  approach  above  as  direct  deletion. 

3.2  DIRECT  DELETION  FOR  MAR 

In  the  case  of  MAR  data,  we  cannot  use  the  simple  dele¬ 
tion  techniques  that  we  just  described  for  MCAR  data — 
the  resulting  estimates  would  not  be  consistent.  However, 
we  show  next  that  it  is  possible  to  obtain  consistent  esti¬ 
mates  from  MAR  data,  using  a  technique  that  is  as  simple 
and  efficient  as  direct  deletion.  Roughly,  we  can  view  this 
technique  as  deleting  certain  instances  from  the  dataset,  but 
then  re-weighting  the  remaining  ones,  so  that  a  consistent 
estimate  is  obtained.  We  shall  subsequently  show  how  to 
obtain  even  better  estimates  by  factorization. 

Again,  to  estimate  network  parameters  9X\U,  it  suffices  to 
show  how  to  estimate  family  marginals  Pr(Y),  now  under 
the  MAR  assumption.  Let  X'Q  =  Xa  \  Y0  denote  the  fully- 
observed  variables  outside  of  the  family  variables  Y  (i.e., 
X0  =  Y0  U  X'J.  We  have 

Pr(Y)=^Pr(Y0,Ym,X'0) 

Xo 

=  ^Pr(Ym|Y0,X;)Pr(Y0,X'0) 

X' 

Hence,  we  reduced  the  problem  to  estimating  two  sets  of 
probabilities.  Estimating  the  probabilities  Pr(Y0,Xg)  is 
straightforward,  as  variables  Y0  and  X'a  are  fully  observed 
in  the  data.  The  conditional  probabilities  Pr(Ym|Y0,  XJ,) 
contain  partially  observed  variables  Ym,  but  they  are  con¬ 
ditioned  on  all  fully  observed  variables  XQ  =  Y0  U  X'r 
The  MAR  definition  implies  that  each  subset  of  the  data 
that  fixes  a  value  for  XQ  is  locally  MCAR.  Like  the  MCAR 
case,  we  can  estimate  each  conditional  probability  as 

Pr(Ym|Y0,X'0)  =  Pr(YJY0,X'0,RYm=ob). 

This  leads  to  the  following  estimation  algorithm, 

Pr(Y)  «  £Prc(Y£,|Y0,X'0,RY.  =ob)  PrD(YQ,  X'J 

X' 


Algorithm  1  F-MCAR(y,D) 

Input: 

y:  A  state  of  query  variables  Y 

T>:  An  incomplete  dataset  with  data  distribution  Pr© 

Auxiliary: 

Cache:  A  global  cache  of  estimated  probabilities 

Function: 

1:  if  y  =  0  then  return  1 

2:  if  CACHE[y]  /  nil  then  return  CACHEfy] 

3:  £  <—  0  //  Initialize  set  of  estimates 

4:  for  each  y  £  y  do 

5:  u  4—  y  \  {y}  //  Factorize  with  parents  u 

6:  add  PrD(y|u,  Ry=ob)  •  F-MCAR(u,  D)  to  £ 

7:  CACHE[y]  4— Aggregate  estimates  in  £  //  E.g.,  mean 

8:  return  Cache [y] 


Figure  3:  Factorization  Lattice  of  Pr(X,  Y.  Z ) 


which  uses  only  the  fully-observed  variables  of  the  data 
distribution  Pit) .  Note  that  the  summation  requires  only  a 
single  pass  through  the  data,  i.e.,  for  only  those  instantia¬ 
tions  of  X'a  that  appear  in  it.  Again,  Pi  d  tends  to  the  true 
distribution  Pr,  as  the  dataset  size  tends  to  infinity,  imply¬ 
ing  a  consistent  estimate  of  Pr(Y). 

3.3  FACTORED  DELETION 

We  now  propose  a  class  of  deletion  algorithms  that  exploit 
more  data  than  direct  deletion.  In  the  first  step,  we  generate 
multiple  but  consistent  estimates  for  the  query  so  that  each 
estimates  utilizes  different  parts  of  a  dataset  to  estimate  the 
query.  In  the  second  step,  we  aggregate  these  estimates  to 
compute  the  final  estimate  and  thus  put  to  use  almost  all 
tuples  in  the  dataset.  Since  this  method  exploits  more  data 
than  direct  deletion,  it  obtains  a  better  estimate  of  the  query. 

Factored  Deletion  for  MCAR  Algorithm  1  implements 
factored  deletion  for  MCAR.  Let  the  query  of  interest  be 
Pr(Y),  and  let  Y1,  Y2, . . . ,  Yn  be  any  ordering  of  the  n 


variables  in  Y.  Each  ordering  yields  a  unique  factorization: 

n 

Pr(Y)  =  Y[  Pr  (Yi  |  Yi+1, . . . ,  Yn) 

i—1 

We  can  estimate  each  of  these  factors  independently,  on 
the  subset  of  the  data  in  which  all  of  its  variables  are  fully 
observed  (as  in  direct  deletion),  i.e., 

Pr(Y*|Yi+1,  ...,!£)=  Pr(Yi|Yi+1, . . . ,  Y£,  Rz>  =  ob) 

where  Z®  is  the  set  of  partially-observed  variables  in  the 
factor.  When  |Ym|  >  1,  we  can  utilize  much  more  data 
than  direct  deletion.  See  Appendix  D,  for  an  example. 

So  far,  we  have  discussed  how  a  consistent  estimate  of 
Pr(Y)  may  be  computed  given  a  factorization.  Now  we 
shall  detail  how  estimates  from  each  factorization  can  be 
aggregated  to  compute  more  accurate  estimates  of  Pr(Y). 
Let  k  be  the  number  of  variables  in  a  family  Y.  The  num¬ 
ber  of  possible  factorizations  is  k\.  However,  different  fac¬ 
torizations  share  the  same  sub-factors,  which  we  can  es¬ 
timate  once,  and  reuse  across  factorizations.  We  can  or¬ 
ganize  these  computations  using  a  lattice,  as  in  Figure  3, 
which  has  only  2k  nodes  and  A’2fc_1  edges.  Our  algorithm 
will  compute  as  many  estimates  as  there  are  edges  in  this 
lattice,  which  is  only  on  the  order  of  O(nlogn),  where  n 
is  the  number  of  parameters  being  estimated  for  a  family 

Y  (which  is  also  exponential  in  the  number  of  variables  k). 
To  emphasize  the  distinction  with  direct  deletion,  which 
uses  only  those  instances  in  the  data  where  all  variables  in 

Y  are  observed,  factored  deletion  uses  any  instance  in  the 
data  where  at  least  one  variable  in  Y  is  observed. 

More  specifically,  our  factored  deletion  algorithm  first  esti¬ 
mates  the  conditional  probabilities  on  the  edges  of  the  lat¬ 
tice,  each  estimate  using  the  subset  of  the  data  where  its 
variables  are  observed.  Second,  it  propagate  the  estimates, 
bottom-up.  For  each  node,  there  are  several  alternative  es¬ 
timates  available,  on  its  incoming  edges.  There  are  various 
ways  of  aggregating  these  estimates,  such  as  mean,  median, 
and  propagating  the  lowest-variance  estimate.3 

Factored  Deletion  for  MAR  Algorithm  2  implements 
factored  deletion  for  MAR.  Let  Yr[n ,  Vj3 , . . . ,  be  any  or¬ 
dering  of  the  n  partially  observed  variables  Ym  C  Y  and 
let  Xq  =  X0  \  Y0  denote  the  fully-observed  variables  out¬ 
side  of  Y.  Given  an  ordering,  we  have  the  factorization: 

n 

Pr(Y)  =  £  Pr(Y0,  X' )  JJ  Pr  (*£  |  Zj+\  XQ) 

x'o  »=i 

where  Zlm  =  {  Y^  |  i  <  j  <  n}.  We  then  proceed  in  a  man¬ 
ner  similar  to  factored  deletion  for  MCAR  to  estimate  indi¬ 
vidual  factors  and  aggregate  estimates  to  compute  Pr(Y). 
For  equations  and  derivations,  please  see  Appendix  A. 

3  In  initial  experiments,  all  aggregations  performed  similarly. 
Reported  results  use  an  inverse-variance  weighting  heuristic. 


Algorithm  2  F-MAR(y,  D) 

Input: 

y:  A  state  of  query  variables  Y,  consisting  of  y0  and  ym 
T>:  An  incomplete  dataset  with  data  distribution  Pr© 

Function: 

1:  e  <—  0  //  Estimated  probability 

2:  for  each  xD  appearing  in  D  that  agrees  with  yQ  do 
3:  T>Xo  <—  subset  of  D  where  xc  holds 

4:  e  <-  e  +  Ppd(x0)  •  F-MCAR(yTO,DxJ 

5:  return  e 


4  EMPIRICAL  EVALUATION 

To  evaluate  the  learning  algorithms  we  proposed,  we  sim¬ 
ulate  partially  observed  datasets  from  Bayesian  networks, 
and  re-learn  their  parameters  from  the  data.4 

In  our  first  sets  of  experiments,  we  compare  our  parame¬ 
ter  estimation  algorithms  with  EM,  on  relatively  small  net¬ 
works  for  MCAR  and  MAR  data.  These  experiments  are 
intended  to  observe  general  trends  in  our  algorithms,  in 
terms  of  their  computational  efficiency,  but  also  in  terms  of 
the  quality  of  the  parameter  estimates  obtained.  Our  main 
empirical  contributions  are  presented  in  Section  4.3,  where 
we  demonstrate  the  scalability  of  our  proposed  estimation 
algorithms,  to  larger  networks  and  datasets,  compared  to 
EM  (even  when  using  approximate  inference  algorithms). 

We  consider  the  following  algorithms: 

D-MCAR  &  F-MCAR:  direct  deletion  and  factored  dele¬ 
tion  for  MCAR  data. 

D-MAR  &  F-MAR:  direct  deletion  and  factored  deletion 
for  MAR  data. 

EM-A-JT:  EM  with  k  random  restarts,  jointree  inference. 

F-MAR  +  EM-JT:  EM  seeded  with  F-MAR  estimates, 
jointree  inference. 

Remember  that  D-MCAR  and  F-MCAR  are  consistent  for 
MCAR  data  only,  while  D-MAR  and  F-MAR  are  consis¬ 
tent  for  general  MAR  data.  EM  is  consistent  for  MAR  data, 
but  only  if  it  converges  to  maximum-likelihood  estimates. 

We  evaluate  the  learned  parameters  in  terms  of  their  like¬ 
lihood  on  independently  generated,  fully-observed  test 
data,  and  the  Kullback-Leibler  divergence  (KLD)  between 
the  original  and  learned  Bayesian  networks.  We  report 
per-instance  log-likelihoods  (which  are  divided  by  dataset 
size).  We  evaluate  the  learned  models  on  unseen  data,  so 
all  learning  algorithms  assume  a  symmetric  Dirichlet  prior 

4An  implementation  of  our  system  is  available  at  http:  // 
reasoning .cs.ucla.edu/deletion. 


Figure  4:  Learning  the  alarm  network  from  MCAR  data. 

on  the  network  parameters  with  a  concentration  parameter 
of  2  (which  corresponds  to  Laplace  smoothing). 

4.1  MCAR  DATA 

First,  we  consider  learning  from  MCAR  data,  evaluating 
the  quality  of  the  parameters  learned  by  each  algorithm. 
We  simulate  training  sets  of  increasing  size,  from  a  given 
Bayesian  network,  selecting  30%  of  the  variables  to  be  par¬ 
tially  observed,  and  removing  70%  of  their  values  com¬ 
pletely  at  random.  All  reported  numbers  are  averaged  over 
32  repetitions  with  different  learning  problems.  When  no 
number  is  reported,  a  5  minute  time  limit  was  exceeded. 

To  illustrate  the  trade-off  between  data  and  computational 
resources,  Figure  4  plots  the  KLDs  as  a  function  of  dataset 
size  and  time;  further  results  are  provided  in  Table  5  of 
Appendix  B.  First,  we  note  that  in  terms  of  the  final  esti¬ 
mates  obtained,  there  is  no  advantage  in  running  EM  with 
restarts:  EM-l-JT  and  EM-10-JT  learn  almost  identical 
models.  This  indicates  that  the  likelihood  landscape  for 
MCAR  data  has  few  local  optima,  and  is  easy  to  optimize. 
Hence,  EM  may  be  obtaining  maximum-likelihood  esti¬ 
mates  in  these  cases.  In  general,  maximum-likelihood  esti¬ 
mators  are  more  statistically  efficient  (asymptotically)  than 
other  estimators,  i.e.,  they  require  fewer  samples.  However, 
other  estimators  (such  as  method-of-moments)  can  be  more 
computationally  efficient;  see,  e.g.,  Wasserman  (2011).  We 
also  observe  this  trend  here.  EM  obtains  better  estimates 
with  smaller  datasets,  with  smaller  KLDs.  However,  direct 
and  factored  deletion  (D-MCAR  and  F-MCAR)  are  both 
orders-of-magnitude  faster,  and  can  scale  to  much  larger 


datasets,  than  EM  (which  requires  inference).  Further,  F- 
MCAR  needs  only  a  modest  amount  of  additional  data  to 
obtain  comparable  estimates. 

To  compare  our  direct  and  factored  methods,  we  see  that 
F-MCAR  is  slower  than  D-MCAR,  as  it  estimates  more 
quantities  (one  for  each  lattice  edge).  F-MCAR  learns  bet¬ 
ter  models,  however,  as  it  uses  a  larger  part  of  the  available 
data.  Finally,  D-MAR  performs  worse  than  F-MCAR  and 
D-MCAR,  as  it  assumes  the  weaker  MAR  assumption.  All 
learners  are  consistent,  as  all  KLDs  converge  to  zero. 

4.2  MAR  DATA 

Next,  we  consider  the  more  challenging  problem  of  learn¬ 
ing  from  MAR  data,  which  we  generate  as  follows:  (1)  se¬ 
lect  an  m-fraction  of  the  variables  to  be  partially  observed, 
(2)  add  a  missingness  mechanism  variable  Rx  for  each 
partially-observed  variable  X,  (3)  assign  p  parents  to  each 
Rx,  randomly  selected  from  the  set  of  observed  variables, 
giving  preference  to  neighbors  of  X  in  the  network,  (4) 
sample  parameters  for  the  missingness  mechanism  CPTs 
from  a  Beta  distribution,  (5)  sample  a  complete  dataset  with 
Rx  values,  and  (6)  hide  values  of  A'  accordingly. 

For  our  first  MAR  experiment,  we  use  a  small  network  that 
is  tractable  enough  for  EM  to  scale  to  large  dataset  sizes, 
so  that  we  can  observe  trends  in  this  regime.  Figure  5(a) 
shows  KLD  for  the  fire  alarm  network,  which  has 
only  6  variables  (and  hence,  the  complexity  of  inference  is 
negligible).  The  missing  data  mechanisms  were  generated 
with  to  =  0.3,  p  =  2,  and  a  Beta  distribution  with  shape 
parameters  1.0  and  0.5.  All  numbers  are  averaged  over  64 
repetitions  with  different  random  learning  problems.5 

There  is  a  significant  difference  between  EM,  with  and 
without  restarts,  indicating  that  the  likelihood  landscape 
is  challenging  to  optimize  (compared  to  MCAR,  which 
we  just  evaluated).  EM-10-JT  performs  well  for  small 
dataset  sizes,  but  stops  converging  after  around  1,000  in¬ 
stances.  This  could  be  due  to  all  restarts  getting  stuck  in 
local  optima.  The  KLD  of  F-MAR  starts  off  between  EM- 
l-JT  and  EM-10-JT  for  small  sizes,  but  quickly  outper¬ 
forms  EM.  For  the  largest  dataset  sizes,  it  learns  networks 
whose  KLD  is  two  orders  of  magnitude  smaller  than  EM- 
10-JT.  The  KLD  improves  further  when  we  use  F-MAR 
estimates  to  seed  EM.  This  approach  is  on  par  with  EM- 10 
for  small  datasets,  while  still  converging  for  large  dataset 
sizes.  However,  note  that  using  F-MAR  to  seed  EM  will  not 
be  practical  for  larger  networks,  where  inference  becomes  a 

5 On  our  chosen  parameters:  (1)  the  number  of  repetitions  was 
chosen  to  produce  smooth  learning  curves;  (2)  a  Beta  distribution 
with  shape  parameter  1  is  uniform,  and  with  parameter  0.5,  it  is 
slightly  biased  (so  that  it  acts  more  like  an  MAR,  and  less  like  an 
MCAR.  mechanism);  (3)  m  =  0.3  corresponds  to  a  low  amount 
of  missing  data,  and  later  m  =  0.9  corresponds  to  high  amount; 
and  (4)  p  =  2  encourages  sparsity  and  keeps  the  CPTs  small, 
although  setting  p  to  1  or  3  does  not  change  the  results. 


(a)  KL  Divergence  -  Fire  Alarm 


(b)  Likelihood  vs.  Size  -  Alarm  (c)  Likelihood  vs.  Time  -  Alarm 


Figure  5:  Learning  small,  tractable  Bayesian  networks  from  MAR  data.  The  legend  is  given  in  sub-figure  (b). 


bottleneck.  D-MCAR  and  F-MCAR  are  not  consistent  for 
MAR  data,  and  indeed  converge  to  a  biased  estimate  with 
a  KLD  around  0. 1 .  Finally,  we  observe  that  the  factorized 
algorithms  generally  outperform  their  direct  counterparts. 

For  our  second  MAR  experiment,  we  work  with  the  classi¬ 
cal  alarm  network,  which  has  37  variables.  The  missing 
data  mechanisms  were  generated  with  m  =  0.9,  p  =  2,  and 
a  Beta  distribution  with  shape  parameters  0.5.  All  reported 
numbers  are  averaged  over  32  repetitions,  and  when  no 
number  is  reported,  a  10  minute  time  limit  was  exceeded. 

Figures  5(b)  and  5(c)  show  test  set  likelihood  as  a  function 
of  dataset  size  and  learning  time.  EM-10-JT  performs  well 
for  very  small  dataset  sizes,  and  again  outperforms  EM- 
1-JT.  However,  inference  time  is  non-negligible  and  EM- 
10-JT  fails  to  scale  beyond  1,000  instances,  whereas  EM- 
1-JT  scales  to  10,000  (as  one  would  expect).  The  closed- 
form  learners  dominate  all  versions  of  EM  as  a  function 
of  time,  and  scale  to  dataset  sizes  that  are  two  orders  of 
magnitude  larger.  EM  seeded  by  F-MAR  achieves  sim¬ 
ilar  quality  to  EM-10-JT,  while  being  significantly  faster 
than  EM  learners  with  random  seeds.  D-MAR  and  F-MAR 
are  more  computationally  efficient,  and  can  scale  to  much 
larger  dataset  sizes.  Further,  as  seen  in  Figure  5(c),  they 
can  obtain  good  likelihoods  even  before  the  EM  methods 
report  their  first  likelihoods. 

4.3  SCALING  TO  LARGER  NETWORKS 

In  our  last  set  of  experiments  of  this  section,  we  evaluate 
our  algorithms  on  their  ability  to  scale  to  larger  networks, 
with  higher  treewidths,  where  exact  inference  is  more  chal¬ 
lenging.6  Again,  inference  is  the  main  factor  that  limits  the 
scalability  of  algorithms  such  as  EM,  to  larger  networks 
and  datasets  (EM  invokes  inference  as  a  sub-routine,  once 
per  data  instance,  per  iteration).  Tables  2  &  3  report  results 
on  four  networks,  where  we  simulated  MAR  datasets,  as 
in  the  previous  set  of  experiments.  Each  method  is  given  a 
time  limit  of  5  or  25  minutes.  Appendix  C  provides  results 
on  additional  settings.  We  consider  the  following  methods: 

6The  grid  network  has  400  variables,  muninl  has  189  vari¬ 
ables,  water  has  32  variables,  and  barley  has  48  variables. 


EM-JT  The  EM-10-JT  algorithm  used  in  anytime  fashion, 
which  returns,  given  a  time  limit,  the  best  parameters 
found  in  any  restart,  even  if  EM  did  not  converge. 

EM-BP  A  variant  of  EM-JT  that  uses  (loopy)  belief  prop¬ 
agation  for  (approximate)  inference  (in  the  E-step). 

We  see  that  EM-JT,  which  performs  exact  inference,  does 
not  scale  well  to  these  networks.  This  problem  is  mitigated 
by  EM-BR  which  performs  approximate  inference,  yet  we 
find  that  it  also  has  difficulties  scaling  (dashed  entries  in¬ 
dicate  that  EM-JT  and  EM-BP  did  not  finish  1  iteration  of 
EM).  In  contrast,  F-MAR,  and  particularly  D-MAR,  can 
scale  to  much  larger  datasets.  This  efficiency  is  due  to  the 
relative  simplicity  of  the  D-MAR  and  F-MAR  estimation 
algorithms:  they  are  not  iterative  and  require  only  a  single 
pass  over  the  data.  In  contrast,  with  EM-BP,  the  EM  algo¬ 
rithm  is  not  only  iterative,  but  the  BP  algorithm  that  EM-BP 
invokes  as  a  sub-routine,  is  itself  an  iterative  algorithm.  As 
for  accuracy,  F-MAR  typically  obtains  the  best  likelihoods 
(in  bold)  for  larger  datasets,  while  EM-BP  can  perform  bet¬ 
ter  on  smaller  datasets.  We  also  evaluated  D-MCAR  and 
F-MCAR,  although  they  are  not  in  general  consistent  for 
MAR  data.  We  find  that  they  scale  even  further,  and  can 
also  produce  good  estimates  in  terms  of  likelihood. 

5  EXPLOITING  MISSINGNESS  GRAPHS 

We  have  so  far  made  very  general  assumptions  about  the 
structure  of  the  missingness  graph,  capturing  the  MCAR 
and  MAR  assumptions.  In  this  section,  we  show  how  to 
exploit  additional  knowledge  about  the  missingness  graph 
to  further  improve  the  quality  of  our  estimates.  Having 
deeper  knowledge  of  the  nature  of  the  missingness  mech¬ 
anisms  will  even  enable  us  to  obtain  consistent  estimators 
for  datasets  that  are  not  MAR  (in  some  cases). 

5.1  INFORMED  DELETION  FOR  MAR 

Consider  any  MAR  dataset,  and  a  missingness  graph  where 
each  R  £  R  depends  every  observed  variable  in  X0. 
This  would  be  an  MAR  missingness  graph  that  assumes 


Table  2:  Log-likelihoods  of  large  networks,  with  higher  treewidths,  learned  from  MAR  data  (5  min.  time  limit). 


Size 

EM-JT 

EM-BP 

D-MCAR 

F-MCAR 

D-MAR 

F-MAR 

EM-JT 

EM-BP 

D-MCAR 

F-MCAR 

D-MAR 

F-MAR 
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o 

- 

-57.14 

-80.92 

-57.01 

-80.80 

-56.53 

-19.10 

-18.76 

-25.31 

-21.76 

-25.29 

-21.81 
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<N 

o 

- 

-65.41 

-38.54 

-30.07 

-38.27 

-29.86 
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<D 

- 

-14.73 
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-16.45 

-18.93 

-16.36 
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G\ 

T3 

- 

- 

-25.95 
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-25.36 

-22.88 

1 

- 

-20.70 
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-14.90 

-16.33 

-14.67 
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'G 

a 

- 

- 

-22.74 

-22.01 

-21.60 

- 

- 

- 

-15.49 

- 

-14.90 

- 
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- 
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-115.50 
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-115.41 
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- 

-89.22 

-89.54 

-89.26 

-89.60 

-89.14 

103 

.c 

- 

-69.03 

-71.01 

-65.91 

-70.61 

-65.51 

CD 

- 

-74.26 

-71.67 

-70.46 

-71.68 

-70.18 

104 

c 

3 

- 

-157.23 

-56.07 

-54.24 

-55.46 

- 

o3 

0Q 

- 

- 

-56.44 

-55.12 

-56.40 

- 

105 

- 

- 

-52.00 

- 

- 

- 

- 

- 

- 

- 

- 

- 

Table  3:  Log-likelihoods  of  large  networks,  with  higher  treewidths,  learned  from  MAR  data  (25  min.  time  limit). 


Size 

EM-JT 

EM-BP 

D-MCAR 

F-MCAR 

D-MAR 

F-MAR 

EM-JT 

EM-BP 

D-MCAR 

F-MCAR 

D-MAR 

F-MAR 

102 

- 

-49.15 

-80.00 

-56.45 

-79.81 

-55.94 

-18.88 

-18.73 

-25.84 

-22.11 

-25.87 

-22.25 

103 

o 

<N 

- 

-53.64 

-38.14 

-29.32 

-37.75 

-29.09 

ft 

-17.63 

-14.41 

-18.39 

-15.95 

-18.27 

-15.79 

104 

© 

On 

- 

-85.65 

-26.21 

-23.05 

-25.45 

-22.62 

CD 

- 

-14.52 

-15.57 

-14.07 

-15.24 

-13.92 

10s 

'Sh 

- 

- 

-22.78 

-21.54 

-21.60 

-20.79 

- 

-24.99 

-14.17 

-13.46 

-13.71 

-13.19 

10® 

a 

- 

- 

- 

- 

- 

- 

- 

- 

-13.73 

- 

- 

- 

102 

- 

-99.15 

-114.76 

-106.07 

-114.66 

-105.12 

-89.05 

-89.15 

-89.57 

-89.17 

-89.62 

-89.03 

103 

.c 

- 

-67.85 

-74.18 

-67.81 

-73.82 

-67.39 

<D 

- 

-70.38 

-71.86 

-70.54 

-71.87 

-70.27 

104 

c 

3 

- 

-66.62 

-57.50 

-54.94 

-56.96 

-54.64 

H 

o3 

CQ 

- 

-76.48 

-56.37 

-55.13 

-56.33 

- 

10s 

- 

- 

-53.07 

-51.66 

-52.27 

- 

- 

- 

-51.31 

- 

-51.19 

- 

the  least,  in  terms  of  conditional  independencies,  about  the 
causal  mechanisms  R.  If  we  know  more  about  the  nature  of 
the  missingness  (i.e.,  the  variables  that  the  R  depend  on), 
we  can  exploit  this  to  obtain  more  accurate  estimates.  Note 
that  knowing  the  parents  of  an  R  is  effectively  equivalent  to 
knowing  the  Markov  blanket  of  R  (Pearl,  1987),  which  can 
be  learned  from  data  (Tsamardinos,  Aliferis,  Statnikov,  & 
Statnikov,  2003;  Yaramakala  &  Margaritis,  2005).  With 
sufficient  domain  knowledge,  an  expert  may  be  able  to 
specify  the  parents  of  the  R.  It  suffices  even  to  identify 
a  set  of  variables  that  just  contains  the  Markov  blanket. 

Suppose  that  we  have  such  knowledge  of  the  missing  data 
mechanisms  of  an  MAR  problem,  namely  that  we  know 
the  subset  WQ  of  the  observed  variables  XQ  that  suffice  to 
separate  the  missing  values  from  their  causal  mechanisms, 
i.e.,  where  Xm  _LL  R  |  WQ.  We  can  exploit  this  knowl¬ 
edge  in  our  direct  deletion  algorithm,  to  obtain  improved 
parameter  estimates.  In  particular,  we  can  reduce  the  scope 
of  the  summation  in  our  direct  deletion  algorithm  from  the 
variables  XJ,  (the  set  of  variables  in  XQ  that  lie  outside  the 
family  Y),  to  the  variables  W'0  (the  set  of  variables  in  WQ 
that  lie  outside  the  family  Y),  yielding  the  algorithm: 

Pr(Y) 

«  E  Pr®(YJY0,  Wq,  RYm=ob)  Pr-D  (Y0 ,  W'J 

W' 

Again,  we  need  only  consider,  in  the  summation,  the  in¬ 
stantiations  of  that  appear  in  the  dataset. 


Table  4:  alarm  network  with  Informed  MAR  data 


Size 

F-MCAR 

D-MAR 

F-MAR 

ID-MAR  IF-MAR 

Kullback-Leibler  Divergence 

102 

1.921 

2.365 

2.364 

2.021 

2.011 

103 

0.380 

0.454 

0.452 

0.399 

0.375 

104 

0.073 

0.071 

0.072 

0.059 

0.053 

105 

0.041 

0.021 

0.022 

0.011 

0.010 

106 

0.040 

0.006 

0.008 

0.001 

0.001 

Test  Set  Log-Likelihood  (Fully  Observed) 

102 

-11.67 

-12.13 

-12.13 

-11.77 

-11.76 

103 

-10.40 

-10.47 

-10.47 

-10.42 

-10.40 

104 

-10.04 

-10.04 

-10.04 

-10.02 

-10.02 

105 

-10.00 

-9.98 

-9.98 

-9.97 

-9.97 

106 

-10.00 

-9.97 

-9.97 

-9.96 

-9.96 

We  refer  to  this  algorithm  as  informed  direct  deletion.  By 
reducing  the  scope  of  the  summation,  we  need  to  estimate 
fewer  sub-terms  Pri)(Y£JY0,  W'0,  RYm=ob).  This  re¬ 
sults  in  a  more  efficient  computation,  but  further,  each  in¬ 
dividual  sub-expression  can  be  estimated  on  more  data. 
Moreover,  our  estimates  remain  consistent.  We  can  simi¬ 
larly  replace  X0  by  WQ  in  the  factored  deletion  algorithm, 
to  obtain  an  informed  factored  deletion  algorithm. 

Empirical  Evaluation  Here,  we  evaluate  the  benefits  of 
informed  deletion.  In  addition  to  the  MAR  assumption, 
with  this  setting,  we  assume  that  we  know  the  set  of  par¬ 
ents  WQ  of  the  missingness  mechanism  variables.  To  gen- 


erate  data  for  such  a  mechanism,  we  select  a  random  set 
of  s  variables  to  form  WQ.  We  further  employ  the  sam¬ 
pling  algorithm  previously  used  for  MAR  data,  but  now 
insist  that  the  parents  of  R  variables  come  from  WQ.  Ta¬ 
ble  4  shows  likelihoods  and  KLDs  on  the  alarm  network, 
for  s  =  3,  and  other  settings  as  in  the  MAR  experiments. 
Informed  D-MAR  (ID-MAR)  and  F-MAR  (IF-MAR)  con¬ 
sistently  outperform  their  non-informed  counterparts. 

5.2  LEARNING  FROM  MNAR  DATA 

A  missing  data  problem  that  is  not  MAR  is  classified  as 
MNAR.  Here,  the  parameters  of  a  Bayesian  network  may 
not  even  be  identifiable.  Further,  maximum-likelihood  es¬ 
timation  is  in  general  not  consistent,  so  EM  and  gradient 
methods  can  yield  biased  estimates.  However,  if  one  knows 
the  mechanisms  that  dictate  missingness  (in  the  form  of 
a  missingness  graph),  it  becomes  possible  again  to  obtain 
consistent  estimates,  in  some  cases  (Mohan  et  ah,  2013). 

For  example,  consider  the  missingness  graph  of  Figure  2, 
which  is  an  MNAR  problem,  where  both  variables  X  and  Y 
are  partially  observed,  and  the  missingness  of  each  variable 
depends  on  the  value  of  the  other.  Here,  it  is  still  possible 
to  obtain  consistent  parameter  estimates,  as  Pr(X,  Y)  = 

Pr(i?x=ob,  Ry=  ob)  Pr(X* .  Y*\Rx=ob1  Ry=  ob) 
Pt(Rx=  ob\Y\  Ry=  ob)  Pr(f?F=ob| X*,  Rx=  ob) 

For  a  derivation,  see  Mohan  et  al.  (2013).  Such  derivations 
for  recovering  queries  under  MNAR  are  extremely  sensi¬ 
tive  to  the  structure  of  the  missingness  graph.  Indeed,  the 
class  of  missingness  graphs  that  admit  consistent  estima¬ 
tion  has  not  yet  been  fully  characterized. 

6  RELATED  WORK 

When  estimating  the  parameters  of  a  Bayesian  network, 
maximum-likelihood  (ML)  estimation  is  the  typical  ap¬ 
proach,  where  for  incomplete  data,  the  common  wisdom 
among  machine  learning  practitioners  is  that  one  needs 
to  use  Expectation-Maximization  (EM)  or  gradient  meth¬ 
ods  (Dempster,  Laird,  &  Rubin,  1977;  Lauritzen,  1995); 
see  also,  e.g.,  Darwiche  (2009),  Roller  and  Friedman 
(2009),  Murphy  (2012),  Barber  (2012).  Again,  such  meth¬ 
ods  do  not  scale  to  large  datasets  or  large  networks  as  (1) 
they  are  iterative,  (2)  they  suffer  from  local  optima,  and 
most  notably,  (3)  they  require  inference  in  a  Bayesian  net¬ 
work.  Considerable  effort  has  been  expended  in  improving 
on  EM  across  these  dimensions,  in  order  to,  for  example, 

(1)  accelerate  the  convergence  of  EM,  and  to  intelligently 
sample  subsets  of  a  dataset,  e.g.,  Thiesson  et  al.  (2001), 

(2)  escape  local  optima,  e.g.,  (Elidan,  Ninio,  Friedman,  & 
Shuurmans,  2002),  and  (3)  use  approximate  inference  algo¬ 
rithms  in  lieu  of  exact  ones  when  inference  is  intractable, 
e.g.,  Ghahramani  and  Jordan  (1997),  Caffo,  Jank,  and  Jones 


(2005).  Further,  while  EM  is  suitable  for  data  that  is  MAR 
(the  typical  assumption  in  practice),  there  are  some  excep¬ 
tions,  such  as  work  on  recommender  systems  that  explicitly 
incorporate  missing  data  mechanisms  (Marlin  &  Zemel, 
2009;  Marlin,  Zemel,  Roweis,  &  Slaney,  2007,  2011). 

In  the  case  of  complete  data,  the  parameter  estimation  task 
simplifies  considerably,  in  the  case  of  Bayesian  networks: 
maximum-likelihood  estimates  can  be  obtained  inference- 
free  and  in  closed-form,  using  just  a  single  pass  over  the 
data:  (J,|u  =  Pri)(a;|u).  In  fact,  the  estimation  algorithms 
that  we  proposed  in  this  paper  also  obtain  the  same  param¬ 
eter  estimates  in  the  case  of  complete  data,  although  we  are 
not  concerned  with  maximum-likelihood  estimation  here — 
we  simply  want  to  obtain  estimates  that  are  consistent  (as 
in  estimation  by  the  method  of  moments). 

Other  inference-free  estimators  have  been  proposed  for 
other  classes  of  graphical  models.  Abbeel,  Roller,  and 
Ng  (2006)  identified  a  method  for  closed-form,  inference- 
free  parameter  estimation  in  factor  graphs  of  bounded  de¬ 
gree  from  complete  data.  More  recently,  Halpern  and  Son- 
tag  (2013)  proposed  an  efficient,  inference-free  method 
for  consistently  estimating  the  parameters  of  noisy-or  net¬ 
works  with  latent  variables,  under  certain  structural  as¬ 
sumptions.  From  the  perspective  of  maximum-likelihood 
learning,  where  evaluating  the  likelihood  (requiring  infer¬ 
ence)  seems  to  be  unavoidable,  the  ability  to  consistently 
estimate  parameters — without  the  need  for  inference — 
greatly  extends  the  accessibility  and  utility  of  such  mod¬ 
els.  For  example,  it  opens  the  door  to  practical  structure 
learning  algorithms,  under  incomplete  data,  which  is  a  no¬ 
toriously  difficult  problem  in  practice  (Abbeel  et  al.,  2006; 
Jernite,  Halpern,  &  Sontag,  2013). 

7  CONCLUSIONS 

In  summary,  we  proposed  a  family  of  efficient  and  scal¬ 
able  algorithms  for  learning  the  parameters  of  Bayesian 
networks,  from  MCAR  and  MAR  datasets,  and  sometimes 
MNAR  datasets.  Our  parameter  estimates  are  asymptoti¬ 
cally  consistent,  and  further,  they  are  obtained  inference- 
free  and  in  closed-form.  We  further  introduced  and  dis¬ 
cussed  some  improved  approaches  for  parameter  estima¬ 
tion,  when  given  additional  knowledge  of  the  missingness 
mechanisms  underlying  an  incomplete  dataset.  Empiri¬ 
cally,  we  demonstrate  the  practicality  of  our  method,  show¬ 
ing  that  it  can  scale  to  much  larger  datasets,  and  much 
larger  Bayesian  networks,  than  EM. 
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